Addition Similarity



In [1]:

    
# Import libraries
import numpy as np
import pandas as pd
# Import the data
import WTBLoad
wtb = WTBLoad.load()

Question: I want to know how similar 2 additions are. For instance, I'm thinking of brewing a beer with plums and vanilla, and I want to know how similar they are.

How to get there: The dataset shows the percentage of votes that said a style-addition combo would likely taste good. So, we can compare the votes on each style for the two additions, and see how similar they are.



In [2]:

    
import math
# Square the difference of each row, and then return the mean of the column. 
# This is the average difference between the two.
# It will be higher if they are different, and lower if they are similar
def similarity(additionA, additionB):
    diff = np.square(wtb[additionA] - wtb[additionB])
    return diff.mean()

res = []
# Loop through each addition pair
for additionA in wtb.columns:
    for additionB in wtb.columns:
        # Skip if additionA and combo B are the same. 
        # To prevent duplicates, skip if A is after B alphabetically
        if additionA != additionB and additionA < additionB:
            res.append([additionA, additionB, similarity(additionA, additionB)])
df = pd.DataFrame(res, columns=["additionA", "additionB", "similarity"])

Top 10 most similar additions



In [3]:

    
df.sort_values("similarity").head(10)









    Out[3]:






  
    
      
      additionA
      additionB
      similarity
    
  
  
    
      530
      chamomile
      rose hips
      0.011956
    
    
      403
      bourbon
      whiskey
      0.013294
    
    
      962
      grapefruit
      lemon grass
      0.013347
    
    
      928
      ginger
      juniper berries
      0.013454
    
    
      514
      chamomile
      lemon pepper
      0.013545
    
    
      88
      apple
      pear
      0.013556
    
    
      297
      blackberry
      raspberry
      0.013563
    
    
      501
      chamomile
      coriander
      0.014286
    
    
      265
      blackberry
      cherry
      0.014319
    
    
      529
      chamomile
      rhubarb
      0.014383

10 Least Similar additions



In [4]:

    
df.sort_values("similarity", ascending=False).head(10)









    Out[4]:






  
    
      
      additionA
      additionB
      similarity
    
  
  
    
      1432
      red wine
      rye
      0.159639
    
    
      1243
      oak
      red wine
      0.152291
    
    
      1238
      oak
      piña colada
      0.146401
    
    
      1264
      orange peel
      red wine
      0.145567
    
    
      876
      cucumber
      port
      0.143464
    
    
      246
      basil
      port
      0.142268
    
    
      1372
      piña colada
      rye
      0.139274
    
    
      1366
      piña colada
      port
      0.137152
    
    
      1405
      port
      watermelon
      0.132705
    
    
      1434
      red wine
      smoke
      0.129498

Similarity of a specific combo



In [5]:

    
def comboSimilarity(additionA, additionB):
    # additionA needs to be before additionB alphabetically
    if additionA > additionB:
        addition_temp = additionA
        additionA = additionB
        additionB = addition_temp
    return df.loc[df['additionA'] == additionA].loc[df['additionB'] == additionB]
comboSimilarity('plum', 'vanilla')









    Out[5]:






  
    
      
      additionA
      additionB
      similarity
    
  
  
    
      1391
      plum
      vanilla
      0.050466

But is that good or bad? How does it compare to others?



In [6]:

    
df.describe()









    Out[6]:






  
    
      
      similarity
    
  
  
    
      count
      1485.000000
    
    
      mean
      0.043910
    
    
      std
      0.025011
    
    
      min
      0.011956
    
    
      25%
      0.025427
    
    
      50%
      0.037313
    
    
      75%
      0.053579
    
    
      max
      0.159639

We can see that the plum vanilla combo is above the mean, and it's closer to the 75th percentile than the 50th percentile. So, we can conclude it's not likely a combo that will be great together, as it's not great in many of the same beers.

	additionA	additionB	similarity
530	chamomile	rose hips	0.011956
403	bourbon	whiskey	0.013294
962	grapefruit	lemon grass	0.013347
928	ginger	juniper berries	0.013454
514	chamomile	lemon pepper	0.013545
88	apple	pear	0.013556
297	blackberry	raspberry	0.013563
501	chamomile	coriander	0.014286
265	blackberry	cherry	0.014319
529	chamomile	rhubarb	0.014383

	additionA	additionB	similarity
1432	red wine	rye	0.159639
1243	oak	red wine	0.152291
1238	oak	piña colada	0.146401
1264	orange peel	red wine	0.145567
876	cucumber	port	0.143464
246	basil	port	0.142268
1372	piña colada	rye	0.139274
1366	piña colada	port	0.137152
1405	port	watermelon	0.132705
1434	red wine	smoke	0.129498

	similarity
count	1485.000000
mean	0.043910
std	0.025011
min	0.011956
25%	0.025427
50%	0.037313
75%	0.053579
max	0.159639